Names

Introduction

We intent to find out what makes a start-up attractive for acquisition, and what drives its acquisition price up or down.

In this study, we plan to analyze the acquisition data of more than 2000 companies from all over the world, which are founded from year 1900 to year 2014. Particulary, we have interests in exploring what factors impact the acquisition price of a company, and trying to predict the company’s acquisition price using regression. The data set has 2473 rows of data.

We consider the following variable as the response variable:

variable name description data scoure data type variable type # of missing records
aquisition_price_amount Amount paid for aquisition acquisitions.csv int numerical 0

We consider the following 14 variables as the potential predictor:

variable name description data scoure data type variable type # of missing records
category_code Entity category objects.csv string categorical 475
normalized_name Normalized entity name objects.csv string categorical 0
logo_width Logo width objects.csv float numerical 0
logo_height Logo height objects.csv float numerical 0
description Description of the entity objects.csv string categorical 1380
country_code Country code objects.csv string categorical 514
state_code State code objects.csv string categorical 980
city City name objects.csv string categorical 580
region Region name objects.csv string categorical 1110
investment_rounds Number of investment round participated in objects.csv int numerical 0
investment_companies Number of companies invested in objects.csv int numerical 0
milestones Number of milestones the entity has objects.csv int numerical 0
relationships Number of relationships the entity has objects.csv int numerical 0
acquired_at Date of deal acquisition.csv timestamp numerical 0

Background information of the data set

The database comes from a Kaggle public dataset about startup investments published by Justinas Cirtautas at the end of 2019. The dataset covers more 450 thousand company information from all over the world till 2013. Among those, about 10 thousand companies acomplished their acquisition successfully, while 80% of them got 0 dollars in their acquisition.

The entire dataset provides 11 cvs file with 154 columns, which covers six aspects of the startup ecosystem including organizations, individuals, company news, funding rounds, acquisitions, and IPOs. We populated all the 11 csv files. More detials can be found in the appendix at the end of this proposal. Considering the missing data rate, we determine to focus on only two files: acquisition.csv and objects.csv.

  • acquisition.csv provides the acquisition details, including the acquisition amount, currency and the deal date.
  • objects.csv provides the basic company profiles, which can be used as the predictor of our study.

Purpose statement

Venture capital investments hits 9.5 billion U.S. dollars in the internet industry in the United States, as of first quarter 2020. Other leading VC sectors in terms of investment were healthcare, and software. Inspired by the florish startup world, we want to investigate what factors can impact the company price during acquisition.

In this study, we select the price of the company for acquisition as our ‘Y’, the response variable. Our target is to find a set of predictors and use regression method to predict our Y. We checked all dataset and select 14 varaibles as our potential predictors. Later on we will do feature engineering to determine our X’s in the regression.

Meanwhile, we also want to know some interesting questions, like, whether the company name and the logo size has any impact on the acquisition price, or how the acquisition distributed by geography.

Loading the data into R

We merged two csv file to get our final database startups.csv. For the details in data cleaning, please refer to the Appendix at the end of the proposal.

Below is the snapshot to show we load our data into R successfully. We are including a subset of the columns for simplicity, because some columns have very long text entries.

sample_data = read.csv("startups.csv")


knitr::kable(head(sample_data[, c(
  "aquisition_price_amount",
  "city",
  "funding_total_usd",
  "category_code",
  "funding_rounds",
  "milestones",
  "relationships"
)]))
aquisition_price_amount city funding_total_usd category_code funding_rounds milestones relationships
20000000 Culver City 0 games_video 0 0 6
47500000 Mountain View 5000000 web 1 3 14
15000000 San Francisco 3000000 games_video 1 2 6
262500000 Shavano Park 274999 hardware 1 0 4
36500000 Nashville 15286415 photo_video 3 1 2
100000000 San Mateo 28000000 hardware 1 3 7
<<<<<<< HEAD

Explory Anlaysis

Univarite Analysis

# read the dataset
data0 = read.csv('startups1.csv', header = TRUE)

First, we checked the distribution of aquisition price.

hist(data0$aquisition_price_amount, breaks = 100)

As we can see from the plot above, the distrbution is highly right skewed. Then we use the box plot to check if there is any outliers.

boxplot(data0$aquisition_price_amount)

As we can see there is one record in the dataset with extremely high acquicition price. We checked the database further and found its acqusition price is 2600 billion, which equals the valuation of google and amazon. So we think this one should be an database error, we will exclude this database in our further analysis.

data1 = data0[data0$aquisition_price_amount < 10^12,]
boxplot(data1$aquisition_price_amount)

After excluding the error in our database, the distribution of aquisition price is still highly right skewed. We will consider the log transformation on the aquisition price.

data1["logprice"] = apply(data1["aquisition_price_amount"],2,log)
hist(data1$logprice, breaks = 100)

boxplot(data1$logprice)

=======

Methods

Data Exploration (Yu)

Univarite Analysis

# read the dataset
data0 = read.csv('startups1.csv', header = TRUE)

First, we checked the distribution of aquisition price.

hist(data0$aquisition_price_amount, breaks = 100)

As we can see from the plot above, the distrbution is highly right skewed. Then we use the box plot to check if there is any outliers.

boxplot(data0$aquisition_price_amount)

As we can see there is one record in the dataset with extremely high acquicition price. We checked the database further and found its acqusition price is 2600 billion, which equals the valuation of google and amazon. So we think this one should be an database error, we will exclude this database in our further analysis.

data1 = data0[data0$aquisition_price_amount < 10^12,]
boxplot(data1$aquisition_price_amount)

After excluding the error in our database, the distribution of aquisition price is still highly right skewed. We will consider the log transformation on the aquisition price.

data1["logprice"] = apply(data1["aquisition_price_amount"],2,log)
hist(data1$logprice, breaks = 100)

boxplot(data1$logprice)

>>>>>>> 9fb3f8a7077bbcc5f0845e9546f51b9c2f2851e1

The distributino of log aquisition price is relatively sysmetric. From the boxplot we can still find there are many records is outside of the 1.75 IQR. But we think we can start from here to build our model.

Then we checked the precdictor in the database one by one. For category_code, this is a categorical varaible.

summary(data1$category_code)
##      advertising        analytics       automotive          biotech 
##               84                3                3              263 
##        cleantech       consulting        ecommerce        education 
##               39               32               50                6 
##       enterprise          fashion          finance      games_video 
##              115                3               20               88 
##         hardware           health      hospitality            local 
##               78               10                7                1 
##    manufacturing          medical        messaging           mobile 
##               21                4                5              117 
##            music         nanotech  network_hosting             news 
##                3                1               64                9 
##            other      photo_video public_relations      real_estate 
##               84                5               74                6 
##           search         security    semiconductor           social 
##               19               36               71                4 
##         software           sports   transportation           travel 
##              403                2                3                4 
<<<<<<< HEAD
##              web             NA's 
##              260              475
plot(data1$category_code, data1$logprice)

======= ## web NA's ## 260 475
plot(data1$category_code, data1$logprice)

>>>>>>> 9fb3f8a7077bbcc5f0845e9546f51b9c2f2851e1

The records concentrated in several major levels of category_code. As for the boxplot between category_code and the response, we can find significant mean within group difference between some groups, which shows this variable has some prediction power. However, at the same time, the varaible within group is also large, which means only relying on this variable cannot lead to a accurate result. So we decided to create some dummy varaibles for the major levels of this variable and put them in our candidate predictor list.

Then we turn to the name of the company. We extract the length of the company name as a feature.

data1["name_length"] = apply(data1["normalized_name"],2,nchar)
plot(data1$name_length, data1$logprice)
<<<<<<< HEAD

=======

>>>>>>> 9fb3f8a7077bbcc5f0845e9546f51b9c2f2851e1

After check the relationship between length of the company name, we found if the company name has more than 30 characters, the acquisition price tend to decrease. so we plan to create a dummy variable, company name lenght > 30 or not, and add it to our candidate predictor list.

Then we check the logo width.

plot(data1$logo_width, data1$logprice)
fit = lm(logprice ~ logo_width, data1)
abline(fit, col = "red")
<<<<<<< HEAD

For the plot we found in fact, quite amount of the logo_width is 0. Then we guess, maybe in this column, 0 means missing value. so we decide to remove that part and replot the above figure.

=======

For the plot we found in fact, quite amount of the logo_width is 0. Then we guess, maybe in this column, 0 means missing value. so we decide to remove that part and replot the above figure.

>>>>>>> 9fb3f8a7077bbcc5f0845e9546f51b9c2f2851e1
sum(data1$logo_width == 0)
## [1] 502
temp = data1[data1$logo_width > 0, ]
plot(temp$logo_width, temp$logprice)
fit = lm(logprice ~ logo_width, temp)
abline(fit, col = "red")
<<<<<<< HEAD

=======

>>>>>>> 9fb3f8a7077bbcc5f0845e9546f51b9c2f2851e1

It turns out there are 502 records in total has 0 logo_width. After removing them, the relationship between logo_width and log price become not that significant. considering the maximum logo_width is more than 5000 times of the minimum width, the data reliability becomes skeptical.

We further checked the log_height.

plot(data1$logo_height, data1$logprice)
fit = lm(logprice ~ logo_height, data1)
abline(fit, col = "red")
<<<<<<< HEAD

=======

>>>>>>> 9fb3f8a7077bbcc5f0845e9546f51b9c2f2851e1
sum(data1$logo_height == 0)
## [1] 502
temp = data1[data1$logo_height > 0, ]
plot(temp$logo_height, temp$logprice)
fit = lm(logprice ~ logo_height, temp)
abline(fit, col = "red")
<<<<<<< HEAD

=======

>>>>>>> 9fb3f8a7077bbcc5f0845e9546f51b9c2f2851e1

It show the same pattern. Out of curiosity, we create a variable named, logo_ratio, which is the ration of logo height and width.

mask = data1$logo_height > 0
data1["logo_ratio"] = data1["logo_height"] / data1["logo_width"]
plot(data1$logo_ratio[mask], data1$logprice[mask])
fit = lm(logprice ~ logo_ratio, data1[mask,])
abline(fit, col = "red")
<<<<<<< HEAD

=======

>>>>>>> 9fb3f8a7077bbcc5f0845e9546f51b9c2f2851e1

But it turns out seems this variable does not help much. Considering the data sanity issue in those columns, we did not put them to our candidate predictor list.

Then we checked the country_code, which is a categorical variable.

summary(data1$country_code)
##  ANT  ARE  ARG  AUS  AUT  BEL  BMU  BRA  CAN  CHE  CHL  CHN  CYM  CZE  DEU  DNK 
##    1    1    2   23    3    3    1    4   53    8    1   23    1    1   23    5 
##  ESP  FIN  FRA  GBR  GHA  IDN  IND  IRL  ISR  ITA  JOR  JPN  KEN  KOR  LBN  LUX 
##    5    4   20  100    1    1   10    8   34    6    1   11    1    7    1    2 
##  MAR  MDA  MEX  MMR  MYS  NCL  NLD  NOR  NZL  PAK  PRT  RUS  SAU  SGP  SWE  TUN 
##    1    1    3    1    1    1   13    4    1    2    1    4    1    4   14    1 
<<<<<<< HEAD
##  TUR  TWN  USA  VGB  ZAF NA's 
##    1    1 1532    1    5  514
plot(data1$country_code, data1$logprice)

======= ## TUR TWN USA VGB ZAF NA's ## 1 1 1532 1 5 514
plot(data1$country_code, data1$logprice)

>>>>>>> 9fb3f8a7077bbcc5f0845e9546f51b9c2f2851e1

The records concentrated in several major levels of country_code. To be more specific, most records is in USA. As for the boxplot between country_code and the response, we can find significant mean within group difference between some groups, which shows this variable has some prediction power. However, at the same time, the varaible within group is also large, which means only relying on this variable cannot lead to a accurate result. So we decided to create some dummy varaibles for the major levels of this variable and put them in our candidate predictor list.

The same observation found in city.

summary(data1$city)
##       San Francisco            New York       Mountain View         Santa Clara 
##                  99                  84                  44                  44 
##              London           Sunnyvale           Palo Alto            San Jose 
##                  42                  36                  31                  29 
##             Seattle           Cambridge           San Diego              Austin 
##                  29                  28                  27                  24 
##             Chicago         Los Angeles        Redwood City              Boston 
##                  22                  21                  20                  19 
##              Irvine             Waltham           San Mateo             Atlanta 
##                  17                  16                  15                  14 
##            Bellevue              Dublin             Houston           Cupertino 
##                  13                  12                  12                  11 
##              Dallas             Fremont             Boulder              Denver 
##                  11                  11                  10                  10 
##         Marlborough          Menlo Park            Tel Aviv               Paris 
##                   9                   9                   9                   8 
##             Raleigh        Santa Monica               Tokyo             Bedford 
##                   8                   8                   8                   7 
##              Durham          Pittsburgh              Reston South San Francisco 
##                   7                   7                   7                   7 
##           Amsterdam           Arlington            Brisbane            Carlsbad 
##                   6                   6                   6                   6 
##           Charlotte           Englewood             Hayward             Herndon 
##                   6                   6                   6                   6 
##            Milpitas        Philadelphia             Phoenix            Portland 
##                   6                   6                   6                   6 
##               Seoul             Toronto              Vienna         Aliso Viejo 
##                   6                   6                   6                   5 
##           Baltimore            Columbia              Lowell         Morrisville 
##                   5                   5                   5                   5 
##       Overland Park          Pleasanton          Scottsdale           Stockholm 
##                   5                   5                   5                   5 
##          Alpharetta       American Fork             Beijing             Bothell 
##                   4                   4                   4                   4 
##             Bristol            Brooklyn            Campbell        Corte Madera 
##                   4                   4                   4                   4 
##          El Segundo          Emeryville        Gaithersburg        Indianapolis 
##                   4                   4                   4                   4 
##              Irving             Malvern              McLean           Melbourne 
##                   4                   4                   4                   4 
##           Milwaukee         Minneapolis              Moscow       New York City 
##                   4                   4                   4                   4 
##              Newton              Oxford               Plano           Princeton 
##                   4                   4                   4                   4 
##          Richardson         San Antonio           San Bruno          San Carlos 
##                   4                   4                   4                   4 
##              Sydney            Tel-Aviv         Toronto, ON    Westlake Village 
##                   4                   4                   4                   4 
<<<<<<< HEAD
##              Woburn           Ann Arbor             (Other)                NA's 
##                   4                   3                 777                 580
plot(data1$city, data1$logprice)

The records concentrated in several major levels of city. As for the boxplot between city_code and the response, we can find significant mean within group difference between some groups, which shows this variable has some prediction power. However, at the same time, the varaible within group is also large, which means only relying on this variable cannot lead to a accurate result. So we decided to create some dummy varaibles for the major levels of this variable and put them in our candidate predictor list.

For investiment_rounds, we found it showed a slightly positive correlation with log price. We added it to our candidate predictor list.

plot(data1$investment_rounds, data1$logprice)

For investiment_companies, we found it showed a slightly positive correlation with log price. We added it to our candidate predictor list.

plot(data1$invested_companies, data1$logprice)

For milestones, we found it showed a slightly positive correlation with log price. We added it to our candidate predictor list.

plot(data1$milestones, data1$logprice)

======= ## Woburn Ann Arbor (Other) NA's ## 4 3 777 580
plot(data1$city, data1$logprice)

The records concentrated in several major levels of city. As for the boxplot between city_code and the response, we can find significant mean within group difference between some groups, which shows this variable has some prediction power. However, at the same time, the varaible within group is also large, which means only relying on this variable cannot lead to a accurate result. So we decided to create some dummy varaibles for the major levels of this variable and put them in our candidate predictor list.

For investiment_rounds, we found it showed a slightly positive correlation with log price. We added it to our candidate predictor list.

plot(data1$investment_rounds, data1$logprice)

For investiment_companies, we found it showed a slightly positive correlation with log price. We added it to our candidate predictor list.

plot(data1$invested_companies, data1$logprice)

For milestones, we found it showed a slightly positive correlation with log price. We added it to our candidate predictor list.

plot(data1$milestones, data1$logprice)

>>>>>>> 9fb3f8a7077bbcc5f0845e9546f51b9c2f2851e1

For relationships, we found it showed a slightly positive correlation with log price. We added it to our candidate predictor list.

plot(data1$relationships, data1$logprice)
fit = lm(logprice ~ relationships, data1)
abline(fit, col = "red")
<<<<<<< HEAD

### Correlation Analysis Then we checked the correlation matrix of the numerical variables.

data2 = data1[,c("name_length", "investment_rounds", "invested_companies", "milestones", "relationships")]
cor(data2)
##                    name_length investment_rounds invested_companies milestones
## name_length         1.00000000       -0.04786664        -0.04856472 -0.2870902
## investment_rounds  -0.04786664        1.00000000         0.98970339  0.1519085
## invested_companies -0.04856472        0.98970339         1.00000000  0.1576869
## milestones         -0.28709018        0.15190852         0.15768691  1.0000000
## relationships      -0.14405538        0.39209720         0.40107629  0.4476997
##                    relationships
## name_length           -0.1440554
=======

#### Correlation Analysis Then we checked the correlation matrix of the numerical variables.

data2 = data1[,c("name_length", "investment_rounds", "invested_companies", "milestones", "relationships")]
cor(data2)
##                    name_length investment_rounds invested_companies milestones
## name_length         1.00000000       -0.04775877        -0.04843915 -0.2870815
## investment_rounds  -0.04775877        1.00000000         0.98970339  0.1519085
## invested_companies -0.04843915        0.98970339         1.00000000  0.1576869
## milestones         -0.28708145        0.15190852         0.15768691  1.0000000
## relationships      -0.14409414        0.39209720         0.40107629  0.4476997
##                    relationships
## name_length           -0.1440941
>>>>>>> 9fb3f8a7077bbcc5f0845e9546f51b9c2f2851e1
## investment_rounds      0.3920972
## invested_companies     0.4010763
## milestones             0.4476997
## relationships          1.0000000

It turns out the investment_rounds and invested_companies shows strongly collinearity. We should only use one of them in the model to avoid the collinearity issue.

<<<<<<< HEAD

Methods

Data Exploration (Yu)

======= >>>>>>> 9fb3f8a7077bbcc5f0845e9546f51b9c2f2851e1

Data Adjustments (Moha)

Based on the single predictor analysis above, it is now possible to further simplify the dataset that will be used to determine whether the aquisition_price of startups can be correlated with startup characteristics and their performance.

Some data adjustments include:

  • Collapsing of the number of levels for factor variables with a high number levels: category_code, country_code, city, state_code. This will ommit potential dummy variables when dealing with factor variables with levels that provide little significance. The fewer levels and variables also adds benefit of simplifying the model.
  • Removal of columns that are not useful in the statitical analysis: company_id,normalized_name, status, logo_width, logo_height. These are pure descriptions of the startups, and while these could have been a source for additional continuous (e.g. word counts) or discreet variables (mention of certain trendy keywords), we decided to leave out for now to narrow down the analysis on the other variables.
  • Division of the full data into 2 datasets: one for training (startups_tr), with 80% of the observations, and the rest will be included in a test dataset (startups_te), useful for model cross-validation. The split is done randomly.
  • Removal of outlier case of acquisition for 2.6T. For reference, Amazon’s Market Cap is 1.6T. We believe this was an input error.
#Loading of the data
library(readr)
startups = read_csv("./startups.csv")
attr(startups, "spec") = NULL
n = nrow(startups)

#Function that reduces the number of levels of a factor, by replacing some levels with the new.level based on the number of records a level has. 
#If threshold < 1, then it keeps the largest levels by counts up to when a the threshold percentile is reached
#If threshold >= 1, then it keeps the levels that have counts greater or equal to the threshold.
shorten_levels = function(col, threshold, new.level){
  new_col = as.character(col) 
  levels_to_collapse = 0
  level_counts = summary(col)
  counts = aggregate(new_col, by = list(col = new_col), FUN = length)

  if(threshold < 1){
    counts = counts[order(counts$x, decreasing = TRUE), ]
    counts$cumpct = cumsum(counts$x)/sum(counts$x)
    levels_to_collapse = counts[counts$cumpct >= threshold, "col"]
  }else{
    #levels_to_collapse = names(which(level_counts > threshold))
    levels_to_collapse = counts[counts$x < threshold, "col"]
  }
  
  replace = new_col %in% levels_to_collapse
  new_col[replace] = new.level
  as.factor(new_col)
}

#Modified function from https://www.mjdenny.com/Text_Processing_In_R.html to "normalize text"
clean_string = function(string){
    # Lowercase
    temp =  tolower(string)
    # Remove everything that is not a number or letter (
    temp = stringr::str_replace_all(temp,"[^a-zA-Z\\s]", " ")
    # Shrink down to just one white space
    temp = stringr::str_replace_all(temp,"[\\s]+", " ")
    temp
}

#Collapse the number of levels for all factor predictors.
startups$category_code = shorten_levels(startups$category_code, .90, "other")
startups$country_code = shorten_levels(startups$country_code, 20, "other")
startups$city = shorten_levels(clean_string(as.character(startups$city)),20, "other")
startups$state_code = shorten_levels(as.character(startups$state_code),.90, "other")
summary(startups$state_code)
##      CA      CO      FL      IL      MA      MD      NJ      NY   other      PA 
##     581      34      37      42     118      32      46     110     273      41 
##      TX unknown      VA      WA 
##      77     980      43      59
#Remove unnecessary columns (particularly the descriptions and single-level factors and variables that did not seem relevant for the aquisition response variable)
to_remove = c("company_id","normalized_name", "status", "logo_width", "logo_height")
startups_desc = startups[, to_remove]
startups = startups[, -which(colnames(startups) %in% to_remove)]

#Remove outlier in acquisition price
startups = startups[startups$aquisition_price_amount<1e12,]

#Split of data for cross-validation: 80% for training, 20$ for testing.
set.seed(420)
train_index = sample(1:nrow(startups), size = floor(0.8 * nrow(startups)))
startups_tr = startups[train_index,]
startups_te = startups[-train_index,]

The final dataset for analysis includes the following predictors.

str(startups)
<<<<<<< HEAD
## Classes 'tbl_df', 'tbl' and 'data.frame':    2472 obs. of  16 variables:
=======
## tibble [2,472 x 16] (S3: tbl_df/tbl/data.frame)
>>>>>>> 9fb3f8a7077bbcc5f0845e9546f51b9c2f2851e1
##  $ category_code          : Factor w/ 13 levels "advertising",..: 4 13 4 5 8 5 3 13 1 8 ...
##  $ country_code           : Factor w/ 10 levels "AUS","CAN","CHN",..: 10 10 10 10 10 10 10 10 7 8 ...
##  $ state_code             : Factor w/ 14 levels "CA","CO","FL",..: 1 1 1 11 9 1 1 1 12 12 ...
##  $ city                   : Factor w/ 17 levels "austin","cambridge",..: 8 6 12 8 8 8 12 6 8 8 ...
<<<<<<< HEAD
##  $ investment_rounds      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ invested_companies     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ funding_rounds         : num  0 1 1 1 3 1 5 2 0 0 ...
##  $ funding_total_usd      : num  0 5000000 3000000 274999 15286415 ...
##  $ milestones             : num  0 3 2 0 1 3 3 3 0 2 ...
##  $ relationships          : num  6 14 6 4 2 7 38 2 2 59 ...
##  $ aquisition_price_amount: num  2.00e+07 4.75e+07 1.50e+07 2.62e+08 3.65e+07 ...
##  $ top25pct               : num  0 64 0 100 0 0 55 100 0 20 ...
##  $ top50pct               : num  0 18 0 0 0 0 0 0 0 7 ...
##  $ istop25                : num  0 1 0 1 0 0 1 1 0 1 ...
##  $ istop26_50             : num  0 1 0 0 0 0 0 0 0 1 ...
##  $ name_length            : num  7 10 7 8 9 9 6 8 9 8 ...
======= ## $ investment_rounds : num [1:2472] 0 0 0 0 0 0 0 0 0 0 ... ## $ invested_companies : num [1:2472] 0 0 0 0 0 0 0 0 0 0 ... ## $ funding_rounds : num [1:2472] 0 1 1 1 3 1 5 2 0 0 ... ## $ funding_total_usd : num [1:2472] 0 5000000 3000000 274999 15286415 ... ## $ milestones : num [1:2472] 0 3 2 0 1 3 3 3 0 2 ... ## $ relationships : num [1:2472] 6 14 6 4 2 7 38 2 2 59 ... ## $ aquisition_price_amount: num [1:2472] 2.00e+07 4.75e+07 1.50e+07 2.62e+08 3.65e+07 ... ## $ top25pct : num [1:2472] 0 64 0 100 0 0 55 100 0 20 ... ## $ top50pct : num [1:2472] 0 18 0 0 0 0 0 0 0 7 ... ## $ istop25 : num [1:2472] 0 1 0 1 0 0 1 1 0 1 ... ## $ istop26_50 : num [1:2472] 0 1 0 0 0 0 0 0 0 1 ... ## $ name_length : num [1:2472] 7 10 7 8 9 9 6 8 9 8 ...
>>>>>>> 9fb3f8a7077bbcc5f0845e9546f51b9c2f2851e1

Diagnostics and Performance Function (Diego/Moha)

In order to compare models, we use a single function that runs tests on both:

  • Model assumptions: Residual vs fitted and Q-Q plots, bptest and wilk-shapiro test.
  • Model performance: Adjusted r-squared, AIC, BIC, leave one out cross-validation, and regression p-value.

These diagnostics functions are further used in another set of utility functions that provide a side-by-side view of the model performance across the model assumptions, performance, model parameters and their caracteristics.

library(lmtest)
library(MASS)
library(faraway)

get_outliers = function(model){
  stdresid = rstandard(model)
  outliers = abs(stdresid) > 2
  stats = c(
    num_outliers = sum(outliers),
    pct_outliers = mean(outliers)
  )
  
  list(stats = stats, outliers = outliers)
}

get_influentials = function(model){
  cooks = cooks.distance(model)
  influential = cooks > (4 / length(cooks))
  
  stats = c(
    num_influentials = sum(influential),
    pct_influentials = mean(influential)
  )

  list(stats = stats, influentials = influential)
}

get_rmse = function(model, predict_function = NULL, test_actuals = NULL){
  predictions = 0
  if(!is.null(predict_function))
    predict_function(model)
  rsme_val = 0
    
  if(!is.null(test_actuals)) 
    rsme_val = sqrt(mean((test_actuals - predictions) ^ 2)) 
  else
    rsme_val = sqrt(mean(resid(model) ^ 2))
      
  rsme_val
}

#get_rmse(model_transformation_boxcox_selected, predict_function = boxcox_reverse_function, test_actuals = startups_te$aquisition_price_amount)


# Diagnostic function that performs multiple hypothess tests for the model significance, whether the linear model assumptions are valid, and other model performance metrics 
diagnostics_performance = function(model, pcol = "grey",  lcol = "dodgerblue", plotit = TRUE,  testit = TRUE, response_transform = NULL, train_actuals = NULL, as_data_frame=TRUE) {
  #Plot functionality
  if (plotit) {
    par(mfrow = c(1, 2))
    
    #Fitted vs residuals plot
    plot(resid(model) ~ fitted(model), 
         col = pcol, pch = 20, 
         main = "Fitted vs Residuals Plot", xlab = "Fitted", ylab = "Residuals")
    abline(h = 0, col = lcol, lwd = 2)
    
    #Q-Q plot
    qqnorm(resid(model), col = pcol, main = "Normal Q-Q Plot")
    qqline(resid(model),  lty = 2, lwd = 2, col = lcol)
  }
  
  #Test functionality
  if (testit) {
    resids = resid(model)
    if(!is.null(response_transform) && !is.null(train_actuals)){
      fits = response_transform(fitted(model))
      resids = train_actuals-fits
    }
    shapiro_test_pval = shapiro.test(resids)$p.value 
    bp_test_pval = unname(bptest(model)$p.value)
    rmse_loocv = sqrt(mean((resids / (1 - hatvalues(model))) ^ 2))
    r_squared = summary(model)$r.squared
    adj_r_squared = summary(model)$adj.r.squared
    aic = extractAIC(model)[2]
    bic =  extractAIC(model, k = log(n))[2]
    outliers = unname(get_outliers(model)$stats)
    influentials = unname(get_influentials(model)$stats)
    model_pval = unname(pf(summary(model)$fstatistic[1],summary(model)$fstatistic[2],summary(model)$fstatistic[3], lower.tail=FALSE))
    rmse_train = sqrt(mean(resids ^ 2))
    
    
    #Output from test
    res = list(
      "coefs" =  nrow(summary(model)$coefficients),
      "model.pval" = model_pval,
      "r2" =  r_squared,
      "rmse.train" = rmse_train,
      "rmse.test" = NA,
      "bptest.pval" =  bp_test_pval,
      "shapiro.pval" =  shapiro_test_pval,
      "AIC" =  aic,
      "BIC" =  bic,
      "adjR2" =  adj_r_squared,
      "rsme_looocv" =  rmse_loocv,
      "outliers.num" = outliers[1],
      "outliers.pct" = outliers[2],
      "influential.num" = influentials[1],
      "influential.pct" = influentials[2])
    
    if(as_data_frame)
      res = data.frame(unlist(res))
    res
  }
}


#Shows the variable infation factors and their variables for the parameters with high variability due to colinearity.
get_vifs = function(model){
  vifs = faraway::vif(model)
  vifs = data.frame(coeficient =names(vifs[vifs>5]), vif = unname(vifs[vifs>5]))
  vifs = vifs[order(vifs$vif, decreasing=TRUE),]
  vifs
}

#Compares  two or more models side-by-side in a single dataframe, all model assumptions and prformance.
compare_model_performance = function(models, predict_transforms = NULL, test_data, test_actuals_col, train_actuals = NULL, as_data_frame = FALSE){
  dummy = unlist(diagnostics_performance(models[[1]], plotit = F))
  model_comp = matrix(0, nrow = length(dummy), ncol = length(models))
  col_names = rep(0, length(models))
  
  #View(model_comp)

  for(m in 1:length(models)){
    model = models[[m]]
    model_comp[,m] = as.numeric(unlist(diagnostics_performance(model, plotit = FALSE, response_transform = predict_transforms[[m]], train_actuals = train_actuals, as_data_frame = FALSE)))

    transform_func = function(x){x}
    transform_func = ifelse(is.null(predict_transforms), transform_func, predict_transforms[[m]])
    predict_func = custom_predict(model, test_data = test_data, fitted_transform = transform_func)
    
    actuals = as.numeric(unlist(test_data[,test_actuals_col]))
    rmse_test = get_rmse(model, predict_function = predict_func, test_actuals = actuals)
    print(rownames(model_comp))
    print(model_comp)
    model_comp[names(dummy) == "rmse.test", m] = rmse_test #
    
  }
  #print(col_names)
  rownames(model_comp) = names(dummy)
  model_comp
}

<<<<<<< HEAD
=======
#compare_without_influential(model_transformation_boxcox_selected, train_data = startups_tr, test_data = startups_te, test_actuals_col = "aquisition_price_amount", response_transform = boxcox_reverse_function(lambda = 0.12))


>>>>>>> 9fb3f8a7077bbcc5f0845e9546f51b9c2f2851e1
#Helper function to allow custom response transforamtions, relevant to compare models.
custom_predict = function(model, test_data, fitted_transform = NULL){
  fitted.transform = ifelse(is.null(fitted_transform), function(x){x}, fitted_transform)
  function(model){
    fitted_transform(predict(model, test_data))
  }
}

#View model diagnostics with and without influential points.
compare_without_influential = function(model, train_data, test_data, test_actuals_col, response_transform = NULL){
  formula = as.character(model$call)[2]
  influentials = get_influentials(model)$influentials
  new_model = lm(as.formula(formula), data = train_data, subset = !influentials)
  
  compare_coef = compare_model_performance(list(model, new_model), predict_transforms = list(response_transform, response_transform), test_data = test_data, test_actuals_col = test_actuals_col,  train_actuals = as.numeric(unlist(train_data[,test_actuals_col])))
  colnames(compare_coef) = c("W/ Influential", "W/o Influential")
  compare_coef
}


#List all coefficients used by all models, along with a view of coefficient values
compare_parameters = function(models,  variables, model_names = 0){
  all_coefs = data.frame(coefficients = rownames(summary(models[[1]])$coefficients))
  col_names = c("coefficients")
  
  for(m in 1:length(models)){
    model = models[[m]]
    model_name = ifelse(model_names == 0, m, model_names[m])
    
    new_items = data.frame(coefficients = rownames(summary(model)$coefficients))
    for(v in 1:length(variables)){
      new_items = cbind(new_items,
                        variable = summary(model)$coefficients[,variables[v]])
      
      colnames(new_items)[v+1] =  paste(model_name,"_",v)
      
    }
    
    all_coefs = merge(all_coefs, new_items, by = "coefficients", all = TRUE)
    #colnames(all_coefs) = col_names
    
  }
  #colnames(all_coefs) = col_names
  all_coefs
}

Model building

  • Additive Models

All variables

The first step of the exploration is to look at a simple additive model that contains all variables.

#Training the model
additive_full = lm(aquisition_price_amount ~ . , data = startups_tr)

#Diagnostics and performance of the model
knitr::kable(diagnostics_performance(additive_full), digits = 2)
<<<<<<< HEAD

=======

>>>>>>> 9fb3f8a7077bbcc5f0845e9546f51b9c2f2851e1
unlist.res.
coefs 6.200000e+01
model.pval 0.000000e+00
r2 1.600000e-01
rmse.train 1.685398e+09
rmse.test NA
bptest.pval 0.000000e+00
shapiro.pval 0.000000e+00
AIC 8.412779e+04
BIC 8.448821e+04
adjR2 1.300000e-01
rsme_looocv 1.767236e+09
outliers.num 3.900000e+01
outliers.pct 2.000000e-02
influential.num 5.400000e+01
influential.pct 3.000000e-02
#Coefficients with high Variance inflation factors
knitr::kable(get_vifs(additive_full), digits = 2)
coeficient vif
14 invested_companies 49.28
13 investment_rounds 49.22
7 country_codeUSA 46.53
10 cityother 45.72
12 cityunknown 44.11
6 country_codeunknown 28.09
8 state_codeunknown 25.73
9 citynew york 9.44
11 citysan francisco 8.59
5 country_codeother 7.07
3 category_codeunknown 6.84
4 country_codeGBR 5.66
2 category_codesoftware 5.36
1 category_codeother 5.10

We can see that model assumptions are clearly violated. We also see that there are variables with a potentially large variance inflation factor. For example, it would make sense to only use investment_rounds or invested_companies, or country or state, because there can be collinearity as discussed in the data exploration section. 1.97% and 2.73% of points are considered to be outliers and to have influence respectevely. Although we can aim to reduce these, in this first exploration of different additive models it does not seem likely this values will change much.

#Direct correlation between investment_rounds and invested_companies is close to one
cor(startups_tr$investment_rounds,startups_tr$invested_companies)
## [1] 0.9889835
#The partial correlation coefficient is also very close to zero.
#The variation of invested_companies that is unexplained by remaining predictors shows little correlation
#with the variation of aquisition_price_amount that is not explained remaining predictors.
#Adding invested companies is of little use to the model.
model_test_invested_companies_1 = lm(invested_companies ~ .-aquisition_price_amount , data = startups_tr)
model_test_invested_companies_2 = lm(aquisition_price_amount ~ .-invested_companies , data = startups_tr)
cor(resid(model_test_invested_companies_1), resid(model_test_invested_companies_2))
## [1] 0.1080731

We remove the invested_companies and state_code from the predictors and train a new model.

Improving collinearity

#Training the model
additive_full_updated = lm(aquisition_price_amount ~ .-investment_rounds -state_code , data = startups_tr)

#Diagnostics and performance of the model
knitr::kable(diagnostics_performance(additive_full_updated), digits = 2)
<<<<<<< HEAD

=======

>>>>>>> 9fb3f8a7077bbcc5f0845e9546f51b9c2f2851e1
unlist.res.
coefs 4.800000e+01
model.pval 0.000000e+00
r2 1.300000e-01
rmse.train 1.709368e+09
rmse.test NA
bptest.pval 0.000000e+00
shapiro.pval 0.000000e+00
AIC 8.415563e+04
BIC 8.443466e+04
adjR2 1.100000e-01
rsme_looocv 1.764967e+09
outliers.num 3.500000e+01
outliers.pct 2.000000e-02
influential.num 4.400000e+01
influential.pct 2.000000e-02
#Coefficients with high Variance inflation factors
knitr::kable(get_vifs(additive_full_updated), digits = 2)
coeficient vif
9 cityother 32.01
11 cityunknown 30.18
7 country_codeUSA 25.38
6 country_codeunknown 24.98
5 country_codeother 7.07
3 category_codeunknown 6.81
10 citysan francisco 5.95
4 country_codeGBR 5.65
2 category_codesoftware 5.35
1 category_codeother 5.08
8 citynew york 5.03

In the plots, we can see that normality and constant variance are still suspect. We also see confirm that normality suspect due to the low shapiro-wilk test p-value, and that the constant variance is suspect due to low bp test p-value. The R squared and adjusted R squared remain somewhat similar, as well as the RMSE leave one out cross validation. But we now have less variables with a potentially large collinearity. Outlier and influential points remain somewhat similar.

Backwards AIC

Now we will perform a backwards AIC search to optimize for AIC and further decrease collinearity issues.

#Finding the model
model_additive_selected_AIC = step(additive_full_updated, trace = 0)

#Diagnostics and performance of the model
knitr::kable(diagnostics_performance(model_additive_selected_AIC), digits = 2)
<<<<<<< HEAD

=======

>>>>>>> 9fb3f8a7077bbcc5f0845e9546f51b9c2f2851e1
unlist.res.
coefs 1.800000e+01
model.pval 0.000000e+00
r2 1.200000e-01
rmse.train 1.719315e+09
rmse.test NA
bptest.pval 0.000000e+00
shapiro.pval 0.000000e+00
AIC 8.411857e+04
BIC 8.422321e+04
adjR2 1.200000e-01
rsme_looocv 1.757006e+09
outliers.num 3.900000e+01
outliers.pct 2.000000e-02
influential.num 5.000000e+01
influential.pct 3.000000e-02
#Coefficients with high Variance inflation factors
knitr::kable(get_vifs(model_additive_selected_AIC), digits = 2)
coeficient vif
2 category_codeunknown 5.96
1 category_codesoftware 5.16

The found model still violates normality and constant variance assumptions, as would normally be expected since this is derived from a model that also violated them. However, we are able to see less predictors with a variance inflation factor higher than 5. Outlier and influential points remain similar. Now we proceed to compare both models with an anova test and decide which is the best additive model.

Selecting the best additive model

anova(model_additive_selected_AIC, additive_full_updated)
## Analysis of Variance Table
## 
## Model 1: aquisition_price_amount ~ category_code + invested_companies + 
##     funding_rounds + relationships + top50pct + istop26_50
## Model 2: aquisition_price_amount ~ (category_code + country_code + state_code + 
##     city + investment_rounds + invested_companies + funding_rounds + 
##     funding_total_usd + milestones + relationships + top25pct + 
##     top50pct + istop25 + istop26_50 + name_length) - investment_rounds - 
##     state_code
##   Res.Df        RSS Df  Sum of Sq      F Pr(>F)
## 1   1959 5.8441e+21                            
## 2   1929 5.7767e+21 30 6.7425e+19 0.7505 0.8334

We can see that the larger model doesn’t offer a statistically significant difference in explaining the relationship of predictors and response. For that reason, from the additive models, we prefer the model found through the backwards BIC method model_additive_selected_AIC. This model still violates normality and constant variance assumptions, and these will not likely be corrected or at least further improved until we look at more complex interaction models or we apply transformations to predictor or response variables (done below).

All in all, we won’t be focusing on this purely additive model model_additive_selected_AIC because of the very low r squared and potential low power to make predictions.

  • Interaction Models

Two-way interactions

We started the exploration by training a 2-way interaction model, excluding investment_rounds and state_code from the variables, due to collinearity.

We also tried to fit an overall 3-way model in a couple of ways. First, by considering all terms besides investment_rounds and state_code, but due to computational limitations we decided to take a different approach (after 3 hours the model was not complete training yet). We also tried to fit an overall 3-way model, excluding city from the interaction terms (and adding it as a separate term on its own) to try to improve machine performance–altough this model offered a good initial R squared (~0.9) it was hard to interpret and still taxing on machine resources.

#Training the model
interaction_model = lm(aquisition_price_amount ~ (. -investment_rounds -state_code)^2 , data = startups_tr)

#Diagnostics and performance of the model
knitr::kable(na.omit(diagnostics_performance(interaction_model), digits = 2))
<<<<<<< HEAD

=======

>>>>>>> 9fb3f8a7077bbcc5f0845e9546f51b9c2f2851e1
unlist.res.
coefs 6.510000e+02
model.pval 0.000000e+00
r2 7.390411e-01
rmse.train 9.380137e+08
bptest.pval 7.170000e-05
shapiro.pval 0.000000e+00
AIC 8.298877e+04
BIC 8.677316e+04
adjR2 6.111201e-01
rsme_looocv Inf

Although this model mathematically has a good R squared and adjusted R squared, the normality and constant variance assumptions are still violated. Furthermore, because there are 906 predictors, this is not an easy to understand model. We will find a simplified model doing a backwards AIC search.

Forward AIC

Because the previous model has 906 predictors, we use a forward search to optimize for computational resources. We tried to follow use a BIC approach so that we optimize for reducing the number of predictors, but the resulting R squared was quite low (0.10), so we decided to continue with the AIC method.

#Finding the model
model_interaction_start = lm(aquisition_price_amount ~ 1, data = startups_tr)
model_interaction_selected = step(
  model_interaction_start,
  scope = ~ category_code * funding_rounds + category_code * milestones +
    category_code * aquisition_price_amount + category_code * top50pct + category_code *
    istop26_50 + category_code * country_code + category_code * city + category_code *
    invested_companies + category_code * funding_total_usd + category_code *
    relationships + category_code * top25pct + category_code * istop25 + category_code *
    name_length + funding_rounds * milestones + funding_rounds * aquisition_price_amount +
    funding_rounds * top50pct + funding_rounds * istop26_50 + funding_rounds *
    country_code + funding_rounds * city + funding_rounds * invested_companies +
    funding_rounds * funding_total_usd + funding_rounds * relationships + funding_rounds *
    top25pct + funding_rounds * istop25 + funding_rounds * name_length + milestones *
    aquisition_price_amount + milestones * top50pct + milestones * istop26_50 +
    milestones * country_code + milestones * city + milestones * invested_companies +
    milestones * funding_total_usd + milestones * relationships + milestones *
    top25pct + milestones * istop25 + milestones * name_length + aquisition_price_amount *
    top50pct + aquisition_price_amount * istop26_50 + aquisition_price_amount *
    country_code + aquisition_price_amount * city + aquisition_price_amount *
    invested_companies + aquisition_price_amount * funding_total_usd + aquisition_price_amount *
    relationships + aquisition_price_amount * top25pct + aquisition_price_amount *
    istop25 + aquisition_price_amount * name_length + top50pct * istop26_50 +
    top50pct * country_code + top50pct * city + top50pct * invested_companies +
    top50pct * funding_total_usd + top50pct * relationships + top50pct * top25pct +
    top50pct * istop25 + top50pct * name_length + istop26_50 * country_code +
    istop26_50 * city + istop26_50 * invested_companies + istop26_50 * funding_total_usd +
    istop26_50 * relationships + istop26_50 * top25pct + istop26_50 * istop25 +
    istop26_50 * name_length + country_code * city + country_code * invested_companies +
    country_code * funding_total_usd + country_code * relationships + country_code *
    top25pct + country_code * istop25 + country_code * name_length + city *
    invested_companies + city * funding_total_usd + city * relationships + city *
    top25pct + city * istop25 + city * name_length + invested_companies * funding_total_usd +
    invested_companies * relationships + invested_companies * top25pct + invested_companies *
    istop25 + invested_companies * name_length + funding_total_usd * relationships +
    funding_total_usd * top25pct + funding_total_usd * istop25 + funding_total_usd *
    name_length + relationships * top25pct + relationships * istop25 + relationships *
    name_length + top25pct * istop25 + top25pct * name_length + istop25 * name_length,
  direction = "forward",
  trace = 0
)

#Diagnostics and performance of the model
knitr::kable(na.omit(diagnostics_performance(model_interaction_selected), digits = 2))
<<<<<<< HEAD

=======

>>>>>>> 9fb3f8a7077bbcc5f0845e9546f51b9c2f2851e1
unlist.res.
coefs 8.700000e+01
model.pval 0.000000e+00
r2 5.475401e-01
rmse.train 1.235132e+09
bptest.pval 0.000000e+00
shapiro.pval 0.000000e+00
AIC 8.294879e+04
BIC 8.345454e+04
adjR2 5.269520e-01
rsme_looocv Inf

Although this model optimizes for AIC (it has a lower AIC than the first two-way interaction model), this model performs worse in other areas. In particular the both the R squared and adjusted R squared decreased (0.55 and 0.53 respectively). The normality and constant variance assumptions continue to be violated.

Selecting the best interaction model

Although we prefer the first two-way interaction model due to its better performance, we run an anova test to check the significance of the additional parameters included on the first two-way model vs the model found via forward AIC.

anova(model_interaction_selected, interaction_model)
## Analysis of Variance Table
## 
## Model 1: aquisition_price_amount ~ relationships + category_code + funding_rounds + 
##     istop26_50 + top50pct + invested_companies + funding_total_usd + 
##     milestones + relationships:category_code + category_code:funding_rounds + 
##     funding_rounds:istop26_50 + relationships:top50pct + category_code:invested_companies + 
##     category_code:istop26_50 + category_code:top50pct + relationships:invested_companies + 
##     funding_rounds:invested_companies + top50pct:invested_companies + 
##     relationships:istop26_50 + top50pct:milestones
## Model 2: aquisition_price_amount ~ ((category_code + country_code + state_code + 
##     city + investment_rounds + invested_companies + funding_rounds + 
##     funding_total_usd + milestones + relationships + top25pct + 
##     top50pct + istop25 + istop26_50 + name_length) - investment_rounds - 
##     state_code)^2
##   Res.Df        RSS  Df  Sum of Sq      F    Pr(>F)    
## 1   1890 3.0160e+21                                    
## 2   1326 1.7395e+21 564 1.2765e+21 1.7253 1.137e-15 ***
## ---
<<<<<<< HEAD
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
======= ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>>>>>>> 9fb3f8a7077bbcc5f0845e9546f51b9c2f2851e1

The anova test confirms that the initial model does include predictors that are significant. The preferred interaction model is interaction_model.

  • Transformation Models

Boxcox method

We start with using the same predictors used in the first 2-way interaction model. We use a \(\lambda\) value of 0.12. Also, we apply a logarithmic transformation in the predictors funding_total_usd and relationships (+1 is added to these predictors because there are records with the value 0 which would lead to issues when applying the log transformation), as this showed an improvement in decreasing adjusted r squared and AIC.

#Initial model, before transformation
model_transformation_boxcox_initial = lm(aquisition_price_amount ~ (. -investment_rounds -state_code)^2 + log(funding_total_usd+1) +log(relationships+1) , data = startups_tr)

#Boxcox plot
boxcox(model_transformation_boxcox_initial, plotit = TRUE, lambda = seq(0.1, 0.14, by = 0.01))
<<<<<<< HEAD

=======

>>>>>>> 9fb3f8a7077bbcc5f0845e9546f51b9c2f2851e1
#Training the model
model_transformation_boxcox = lm((((aquisition_price_amount ^ 0.12) - 1) / 0.12) ~ (. -investment_rounds -state_code)^2 + log(funding_total_usd+1) +log(relationships+1) , data = startups_tr)

#Define the transformation functions to revert back to the units of aquisition_price
boxcox_reverse_function = function(lambda = 0.12){
  function(val){
    (val*lambda + 1)^(1/lambda)
  }
}


#Diagnostics and performance of the model
knitr::kable(na.omit(diagnostics_performance(model_transformation_boxcox, response_transform = boxcox_reverse_function(lambda = 0.12), train_actuals = startups_tr$aquisition_price_amount)), digits = 2)
<<<<<<< HEAD

=======

>>>>>>> 9fb3f8a7077bbcc5f0845e9546f51b9c2f2851e1
unlist.res.
coefs 6.530000e+02
model.pval 0.000000e+00
r2 6.100000e-01
rmse.train 1.106672e+09
bptest.pval 1.000000e+00
shapiro.pval 0.000000e+00
AIC 1.165783e+04
BIC 1.545384e+04
adjR2 4.100000e-01

Although in this model we are seeing lower r squared and adjusted r squared values than we obtained with the two-way interaction model, now the constant variance assumption has improved (we can see this both in the fitted vs residuals plot, and in the result of the bp test, which provides a p-value of 0.9997102). Normality is still suspect.

Forward AIC, using boxcox model as the scope

We use the previously trained model as the full scope for a forward AIC search.

#Finding the model
model_transformation_boxcox_start = lm(((aquisition_price_amount ^ 0.12 - 1)/ 0.12) ~ 1, data = startups_tr)
model_transformation_boxcox_selected = step(
  model_transformation_boxcox_start,
  scope =
    ((aquisition_price_amount ^ 0.12 - 1)/ 0.12) ~ category_code * funding_rounds + category_code * milestones +
    category_code * aquisition_price_amount + category_code * top50pct + category_code *
    istop26_50 + category_code * country_code + category_code * city + category_code *
    invested_companies + category_code * funding_total_usd + category_code *
    relationships + category_code * top25pct + category_code * istop25 + category_code *
    name_length + funding_rounds * milestones + funding_rounds * aquisition_price_amount +
    funding_rounds * top50pct + funding_rounds * istop26_50 + funding_rounds *
    country_code + funding_rounds * city + funding_rounds * invested_companies +
    funding_rounds * funding_total_usd + funding_rounds * relationships + funding_rounds *
    top25pct + funding_rounds * istop25 + funding_rounds * name_length + milestones *
    aquisition_price_amount + milestones * top50pct + milestones * istop26_50 +
    milestones * country_code + milestones * city + milestones * invested_companies +
    milestones * funding_total_usd + milestones * relationships + milestones *
    top25pct + milestones * istop25 + milestones * name_length + aquisition_price_amount *
    top50pct + aquisition_price_amount * istop26_50 + aquisition_price_amount *
    country_code + aquisition_price_amount * city + aquisition_price_amount *
    invested_companies + aquisition_price_amount * funding_total_usd + aquisition_price_amount *
    relationships + aquisition_price_amount * top25pct + aquisition_price_amount *
    istop25 + aquisition_price_amount * name_length + top50pct * istop26_50 +
    top50pct * country_code + top50pct * city + top50pct * invested_companies +
    top50pct * funding_total_usd + top50pct * relationships + top50pct * top25pct +
    top50pct * istop25 + top50pct * name_length + istop26_50 * country_code +
    istop26_50 * city + istop26_50 * invested_companies + istop26_50 * funding_total_usd +
    istop26_50 * relationships + istop26_50 * top25pct + istop26_50 * istop25 +
    istop26_50 * name_length + country_code * city + country_code * invested_companies +
    country_code * funding_total_usd + country_code * relationships + country_code *
    top25pct + country_code * istop25 + country_code * name_length + city *
    invested_companies + city * funding_total_usd + city * relationships + city *
    top25pct + city * istop25 + city * name_length + invested_companies * funding_total_usd +
    invested_companies * relationships + invested_companies * top25pct + invested_companies *
    istop25 + invested_companies * name_length + funding_total_usd * relationships +
    funding_total_usd * top25pct + funding_total_usd * istop25 + funding_total_usd *
    name_length + relationships * top25pct + relationships * istop25 + relationships *
    name_length + top25pct * istop25 + top25pct * name_length + istop25 * name_length +
    log(funding_total_usd + 1) + log(relationships + 1),
  direction = "forward",
  trace = 0
)

#Diagnostics and performance of the model
knitr::kable(na.omit(diagnostics_performance(model_transformation_boxcox_selected, response_transform = boxcox_reverse_function(lambda = 0.12), train_actuals = startups_tr$aquisition_price_amount)), digits = 2)
<<<<<<< HEAD

=======

>>>>>>> 9fb3f8a7077bbcc5f0845e9546f51b9c2f2851e1
unlist.res.
coefs 9.600000e+01
model.pval 0.000000e+00
r2 6.500000e-01
rmse.train 1.376203e+10
bptest.pval 0.000000e+00
shapiro.pval 0.000000e+00
AIC 1.029215e+04
BIC 1.085022e+04
adjR2 6.300000e-01
rsme_looocv 4.252724e+10
outliers.num 1.040000e+02
outliers.pct 5.000000e-02
influential.num 8.800000e+01
influential.pct 4.000000e-02
## Evaluate whether removing influential observations help on the model assumption and fit.
knitr::kable(compare_without_influential(model_transformation_boxcox_selected, train_data = startups_tr, test_data = startups_te, test_actuals_col = "aquisition_price_amount", response_transform = boxcox_reverse_function(lambda = 0.12)), digits = 2)
## NULL
##               [,1] [,2]
##  [1,] 9.600000e+01    0
##  [2,] 0.000000e+00    0
##  [3,] 6.523166e-01    0
##  [4,] 1.376203e+10    0
##  [5,]           NA    0
##  [6,] 8.525834e-60    0
##  [7,] 5.991583e-72    0
##  [8,] 1.029215e+04    0
##  [9,] 1.085022e+04    0
## [10,] 6.347568e-01    0
## [11,] 4.252724e+10    0
## [12,] 1.040000e+02    0
## [13,] 5.260496e-02    0
## [14,] 8.800000e+01    0
## [15,] 4.451189e-02    0
## NULL
##               [,1]         [,2]
##  [1,] 9.600000e+01 9.600000e+01
##  [2,] 0.000000e+00 0.000000e+00
##  [3,] 6.523166e-01 7.023774e-01
##  [4,] 1.376203e+10 5.323183e+09
##  [5,]           NA           NA
##  [6,] 8.525834e-60 5.490834e-69
##  [7,] 5.991583e-72 6.868054e-70
##  [8,] 1.029215e+04 9.260603e+03
##  [9,] 1.085022e+04 9.818669e+03
## [10,] 6.347568e-01 6.866082e-01
## [11,] 4.252724e+10 1.178727e+10
## [12,] 1.040000e+02 1.030000e+02
## [13,] 5.260496e-02 5.452620e-02
## [14,] 8.800000e+01 6.700000e+01
## [15,] 4.451189e-02 3.546850e-02
W/ Influential W/o Influential
unlist.res.1 9.600000e+01 9.600000e+01
unlist.res.2 0.000000e+00 0.000000e+00
unlist.res.3 6.500000e-01 7.000000e-01
unlist.res.4 1.376203e+10 5.323183e+09
unlist.res.5 NA NA
unlist.res.6 0.000000e+00 0.000000e+00
unlist.res.7 0.000000e+00 0.000000e+00
unlist.res.8 1.029215e+04 9.260600e+03
unlist.res.9 1.085022e+04 9.818670e+03
unlist.res.10 6.300000e-01 6.900000e-01
unlist.res.11 4.252724e+10 1.178727e+10
unlist.res.12 1.040000e+02 1.030000e+02
unlist.res.13 5.000000e-02 5.000000e-02
unlist.res.14 8.800000e+01 6.700000e+01
unlist.res.15 4.000000e-02 4.000000e-02

The model shows a significant improvement in performance (adjusted r squared, AIC, BIC) but the constant variance assumption is violated again. For this reason, we still prefer the original boxcox model model_transformation_boxcox.

Logarithmic transformation in response

We also explore using a logarithmic transformation in the response variable, using the same predictors originally used for the first model fitted with the boxcox method above.

#Training the model
model_transformation_log = lm(log(aquisition_price_amount) ~ (. -investment_rounds -state_code)^2 + log(funding_total_usd+1) +log(relationships+1) , data = startups_tr)

#Diagnostics and performance of the model
knitr::kable(na.omit(diagnostics_performance(model_transformation_log, response_transform = exp, train_actuals = startups_tr$aquisition_price_amount)), digits = 2)
<<<<<<< HEAD

=======

>>>>>>> 9fb3f8a7077bbcc5f0845e9546f51b9c2f2851e1
unlist.res.
coefs 6.530000e+02
model.pval 0.000000e+00
r2 6.000000e-01
rmse.train 1.212812e+09
bptest.pval 1.000000e+00
shapiro.pval 0.000000e+00
AIC 3.930550e+03
BIC 7.726560e+03
adjR2 4.100000e-01
rsme_looocv Inf

In R squared and in predicted R squared this model performs marginally worse than the boxcox model, and the constant variance assumption doesn’t seem suspect from the bp test (although we can see it’s not fully met either from the fitted vs residuals plot). However, in AIC and in BIC, this model shows a more significant drop in performance (drop from 3274 to 3931 in AIC and from 7070 to 7727 in BIC). For that reason we prefer the boxcox model model_transformation_boxcox as a transformation-based model

Forward AIC, using lorarigthmic model as the scope

We also use the previously trained log model as the full scope for a forward AIC search.

#Finding the model
model_transformation_log_start = lm(log(aquisition_price_amount) ~ 1, data = startups_tr)
model_transformation_log_selected = step(
  model_transformation_log_start,
  scope =
    log(aquisition_price_amount) ~ category_code * funding_rounds + category_code * milestones +
    category_code * aquisition_price_amount + category_code * top50pct + category_code *
    istop26_50 + category_code * country_code + category_code * city + category_code *
    invested_companies + category_code * funding_total_usd + category_code *
    relationships + category_code * top25pct + category_code * istop25 + category_code *
    name_length + funding_rounds * milestones + funding_rounds * aquisition_price_amount +
    funding_rounds * top50pct + funding_rounds * istop26_50 + funding_rounds *
    country_code + funding_rounds * city + funding_rounds * invested_companies +
    funding_rounds * funding_total_usd + funding_rounds * relationships + funding_rounds *
    top25pct + funding_rounds * istop25 + funding_rounds * name_length + milestones *
    aquisition_price_amount + milestones * top50pct + milestones * istop26_50 +
    milestones * country_code + milestones * city + milestones * invested_companies +
    milestones * funding_total_usd + milestones * relationships + milestones *
    top25pct + milestones * istop25 + milestones * name_length + aquisition_price_amount *
    top50pct + aquisition_price_amount * istop26_50 + aquisition_price_amount *
    country_code + aquisition_price_amount * city + aquisition_price_amount *
    invested_companies + aquisition_price_amount * funding_total_usd + aquisition_price_amount *
    relationships + aquisition_price_amount * top25pct + aquisition_price_amount *
    istop25 + aquisition_price_amount * name_length + top50pct * istop26_50 +
    top50pct * country_code + top50pct * city + top50pct * invested_companies +
    top50pct * funding_total_usd + top50pct * relationships + top50pct * top25pct +
    top50pct * istop25 + top50pct * name_length + istop26_50 * country_code +
    istop26_50 * city + istop26_50 * invested_companies + istop26_50 * funding_total_usd +
    istop26_50 * relationships + istop26_50 * top25pct + istop26_50 * istop25 +
    istop26_50 * name_length + country_code * city + country_code * invested_companies +
    country_code * funding_total_usd + country_code * relationships + country_code *
    top25pct + country_code * istop25 + country_code * name_length + city *
    invested_companies + city * funding_total_usd + city * relationships + city *
    top25pct + city * istop25 + city * name_length + invested_companies * funding_total_usd +
    invested_companies * relationships + invested_companies * top25pct + invested_companies *
    istop25 + invested_companies * name_length + funding_total_usd * relationships +
    funding_total_usd * top25pct + funding_total_usd * istop25 + funding_total_usd *
    name_length + relationships * top25pct + relationships * istop25 + relationships *
    name_length + top25pct * istop25 + top25pct * name_length + istop25 * name_length +
    log(funding_total_usd + 1) + log(relationships + 1),
  direction = "forward",
  trace = 0
)

#Diagnostics and performance of the model
diagnostics_performance(model_transformation_log_selected)
<<<<<<< HEAD

=======

>>>>>>> 9fb3f8a7077bbcc5f0845e9546f51b9c2f2851e1
##                   unlist.res.
## coefs            8.900000e+01
## model.pval      4.480644e-295
## r2               5.843293e-01
## rmse.train       1.989951e+00
## rmse.test                  NA
## bptest.pval      5.368126e-07
## shapiro.pval     1.174242e-31
## AIC              2.898787e+03
## BIC              3.416161e+03
## adjR2            5.649548e-01
## rsme_looocv      2.368853e+00
## outliers.num     1.020000e+02
## outliers.pct     5.159332e-02
## influential.num  7.300000e+01
## influential.pct  3.692463e-02
#Number of predictors
length(coef(model_transformation_log_selected))
## [1] 89

Although this model shows improvements in AIC and BIC, other metrics do not look as good. And the constant variance assumption is more suspect now. For that reason the original logarithmic model model_transformation_log seems better, and overall, we still prefer the original boxcox model model_transformation_boxcox as a transformation-based model.

  • Summary

In summary, the two preferred models are: interaction_model (interaction), model_transformation_boxcox (transformation). We are omitting model_additive_selected_AIC (additive) because of its low r squared.

Results

#for interaction_model and model_additive_selected_AIC,no response transformation
<<<<<<< HEAD
simple_function = function(x) {
  x
}

compare_model_performance(
  models = list(interaction_model, model_additive_selected_AIC),
  predict_transforms = list(simple_function,  simple_function),
  test_data = startups_te,
  test_actuals_col = "aquisition_price_amount",
  train_actuals = startups_tr$aquisition_price_amount
)
======= simple_function = function(x){x} compare_model_performance( models = list(interaction_model, model_additive_selected_AIC), predict_transforms = list(simple_function, simple_function), test_data = startups_te, test_actuals_col = "aquisition_price_amount", train_actuals = startups_tr$aquisition_price_amount) >>>>>>> 9fb3f8a7077bbcc5f0845e9546f51b9c2f2851e1
## Warning in predict.lm(model, test_data): prediction from a rank-deficient fit
## may be misleading
## NULL
##                [,1] [,2]
##  [1,]  6.510000e+02    0
##  [2,] 2.889544e-160    0
##  [3,]  7.390411e-01    0
##  [4,]  9.380137e+08    0
##  [5,]            NA    0
##  [6,]  7.169900e-05    0
##  [7,]  8.984570e-55    0
##  [8,]  8.298877e+04    0
##  [9,]  8.677316e+04    0
## [10,]  6.111201e-01    0
## [11,]           NaN    0
## [12,]            NA    0
## [13,]            NA    0
## [14,]            NA    0
## [15,]            NA    0
## NULL
##                [,1]         [,2]
##  [1,]  6.510000e+02 1.800000e+01
##  [2,] 2.889544e-160 3.474939e-45
##  [3,]  7.390411e-01 1.232725e-01
##  [4,]  9.380137e+08 1.719315e+09
##  [5,]            NA           NA
##  [6,]  7.169900e-05 3.098988e-19
##  [7,]  8.984570e-55 8.583976e-65
##  [8,]  8.298877e+04 8.411857e+04
##  [9,]  8.677316e+04 8.422321e+04
## [10,]  6.111201e-01 1.156643e-01
## [11,]           NaN 1.757006e+09
## [12,]            NA 3.900000e+01
## [13,]            NA 1.972686e-02
## [14,]            NA 5.000000e+01
## [15,]            NA 2.529084e-02
##                        [,1]         [,2]
## unlist.res.1   6.510000e+02 1.800000e+01
## unlist.res.2  2.889544e-160 3.474939e-45
## unlist.res.3   7.390411e-01 1.232725e-01
## unlist.res.4   9.380137e+08 1.719315e+09
## unlist.res.5             NA           NA
## unlist.res.6   7.169900e-05 3.098988e-19
## unlist.res.7   8.984570e-55 8.583976e-65
## unlist.res.8   8.298877e+04 8.411857e+04
## unlist.res.9   8.677316e+04 8.422321e+04
## unlist.res.10  6.111201e-01 1.156643e-01
## unlist.res.11           NaN 1.757006e+09
## unlist.res.12            NA 3.900000e+01
## unlist.res.13            NA 1.972686e-02
## unlist.res.14            NA 5.000000e+01
## unlist.res.15            NA 2.529084e-02

Diego

Below

Discussion

Appendix: Data Pre-cleaning

Data load

The startup data comes from Kaggle: [https://www.kaggle.com/justinas/startup-investments]

This dataset contains information about the startup ecosystem: organizations, individuals, company news, funding rounds, acquisitions, and IPOs, extracted originally from the Crunchabse Data website. In this project, only 6 of the 11 tables will be used and joined using unique IDs.

  • People Data Loading:
library(readr)

#People
people = read_csv("./people.csv")
colnames(people)[2] = "person_id"
attr(people, "spec") = NULL
colnames(people)
  • People Education Data Loading:
degrees = read_csv("./degrees.csv")
colnames(degrees)[2] = "person_id"
attr(degrees, "spec") = NULL
colnames(degrees)
  • Companies Data Loading:
#Company
companies = read_csv("./objects.csv")
attr(companies, "spec") = NULL
attr(companies, "problems") = NULL
colnames(companies)[1] = "company_id"
colnames(companies)
  • People/Company Relationship Data Loading:
comp_peep = read_csv("./relationships.csv")
colnames(comp_peep)[3] = "person_id"
colnames(comp_peep)[4] = "company_id"
colnames(comp_peep)
attr(comp_peep, "spec") = NULL
#str(comp_peep)
  • Company Aquisition Data Loading:
aquisitions = read_csv("./acquisitions.csv")
colnames(aquisitions)[4] = "company_id"
colnames(aquisitions)[6] = "aquisition_price_amount"
attr(aquisitions, "spec") = NULL
colnames(aquisitions)

#Filter out companies with aquisition amount not in USD (all other variable/predictors are in USD)
aquisitions = aquisitions[aquisitions$price_currency_code == "USD",c("company_id", "aquisition_price_amount")]

Data Preparation

After loading the individual tables, we need to join all into a single dataset that can be used for analysis.

Company Founders

The people, degrees and their relationship to the companies can all be joined into a sigle people dimension table. Additionally, we can extract potential relevant predictors for regression. For every person, generally an executive, we identify in each company people that attended to any of the top 25 or top 50 universities in the world (“top 50” excludes the top 25 universities). Also, we count the number of people in each company as well as the count of people by company that attended the top 25 and top 50 universities. Finally, to normalize these numbers, we also additional columns that represent percentage of people in each company that attended the top 25 and top 50 universities.

#Auxiliary functions to count rows for analysis of data
count_disticts = function(x){
  length(unique(x))
}
counts = function(x){
  length(x)
}

#Join people with their degrees information and with which company these people belong to.
full_peep = merge(people, degrees, by="person_id")
founders = merge(full_peep, comp_peep, by="person_id")

top1 = grep(".*University of Oxford.*|.*xford.*|.*California Institute of Technology.*|.*altech|.*University of Cambridge.*|.*ambridge.*|.*Stanford University.*|.*tanford.*|.*Massachusetts Institute of Technology.*|.*MIT.*|.*Princeton University.*|.*inceton.*|.*Harvard University.*| .*arvard#|.*Yale University.*|.*Yale.*|.*University of Chicago.*|.*Imperial College London.*|.*University of Pennsylvania.*|.*Johns Hopkins University.*|.*University of California Berkeley.*|.*Berkley.*|.*ETH Zurich.*|.*UCL.*|.*Columbia University.*|.*Columbia.*|.*University of California Los Angeles.*|.*University of Toronto.*|.*Cornell University.*|.*Cornell.*|.*Duke University.*|.*University of Michigan-Ann Arbor.*|.*Northwestern University.*|.*Tsinghua University.*|.*Peking University.*|.*National University of Singapore.*", ignore.case = TRUE, x = founders$institution)

top2 = grep(".*University of Washington.*|.*Carnegie Mellon University.*|.*Carnegie.*|.*London School of Economics and Political Science.*|.*LSE.*|.*New York University.*|.*University of Edinburgh.*|.*University of California San Diego.*|.*LMU Munich LMU.*|.*University of Melbourne.*|.*University of British Columbia.*|.*University of Hong Kong.*|.*King’s College London Kings College London.*|.*The University of Tokyo.*|.*École Polytechnique Fédérale de Lausanne.*|.*Lausanne.*|.*Georgia Institute of Technology.*|.*University of Texas at Austin.*|.*Karolinska Institute.*|.*McGill University.*|.*Technical University of Munich.*|.*Heidelberg University.*|.*KU Leuven.*|.*Paris Sciences et Lettres – PSL Research University Paris.*|.*PSL.*|.*The Hong Kong University of Science and Technology.*|.*University of Illinois at Urbana-Champaign.*|.*University of Illinois.*|.*Urbana-Champaign.*|.*Urbana.*|.*Champaign.*|.*Nanyang Technological University .*|.*Australian National University.*", ignore.case = TRUE, x = founders$institution)

top25 = rep(0, nrow(founders))
top50 = rep(0, nrow(founders))
top25[top1] = 1
top50[top2] = 1
founders["top25"] = top25
founders["top50"] = top50
founders = aggregate(x = founders[,c("top25", "top50")], by = list(company_id = founders$company_id, person_id = founders$person_id), FUN = max)
peeps = aggregate(x = founders[,c("company_id", "person_id")], by = list(company_id = founders$company_id), FUN = counts)
peeps = peeps[,c(1,2)]
colnames(peeps)[2] = "people"
founders = aggregate(x = founders[,c("top25", "top50")], by = list(company_id = founders$company_id), FUN = sum)
founders = merge(founders, peeps[,c(1,2)], by="company_id")
founders$top25pct = round(100*founders$top25/founders$people)
founders$top50pct = round(100*founders$top50/founders$people)
founders = na.omit(founders[,-c(2,3,4)])

Final Dataset For Analysis

After extracting the individual data aspects (people, company, investments) of the companies, these aspects are joined into a single dataset. The dataset that will be used for this project will depend on the resulting row numbers after different combination of joins.

company_aq = merge(companies, aquisitions, by = "company_id")
#nrow(company_aq)
company_aq_lds = merge(x = company_aq, y = founders, by = "company_id", all.x = TRUE)
#nrow(company_aq_lds)

Preminiary Analysis

To ensure that the dataset is viable for linear regression analysis, we create a preliminary linear model with company aquisition price as the response variable, and several other potential columns in the startups dataset as predictors. We are using the company_aq dataset above.

# Data set to be analized, filtering out aquisition values without aquisition price
startups = company_aq_lds[company_aq_lds$aquisition_price_amount>0, ]
nrow(startups)
col_remove = c("entity_type", "entity_id", "parent_id", "name", "permalink", "founded_at", "closed_at", "domain", "homepage_url", "twitter_username", "logo_url", "short_description", "description", "overview", "tag_list", "region", "first_investment_at", "last_investment_at", "first_funding_at", "last_funding_at", "first_milestone_at", "last_milestone_at", "created_by", "created_at", "updated_at")

startups = startups[,-which(colnames(startups) %in% col_remove)]

#Nulls by column
show_nas = function(somedf){
  cols_nas = data.frame(column = colnames(somedf), numNAs = rep(0, ncol(somedf)))
  for(i in 1:ncol(somedf)){
    cols_nas[i, "numNAs"] = length(somedf[is.na(somedf[,i]),i])
  }
  cols_nas$pctNAs = round(100*cols_nas$numNAs / nrow(somedf), digits = 1)
  cols_nas
}
show_nas(startups)

We clean NA values below:

#clean NAs; introduce "Unknown" for categorical variables where ther are N/As
startups[is.na(startups[,"category_code"]),"category_code"] = "unknown"
startups[is.na(startups[,"country_code"]),"country_code"] = "unknown"
startups[is.na(startups[,"city"]),"city"] = "unknown"
startups[is.na(startups[,"state_code"]),"state_code"] = "unknown"

#For numeric variablestop25pct, top50pct; will convert all of these into a categorical variable that says if people in company attended universities in top 25 universities in worl, then 1. Similarly, if company people (executives), attended universites in the top 26 to 50 universites `istop26_50`, 1; else, 0.
startups[is.na(startups[,"top25pct"]),"top25pct"] = 0
startups[is.na(startups[,"top50pct"]),"top50pct"] = 0
startups$istop25 = ifelse(startups$top25pct > 0, 1, 0)
startups$istop26_50 = ifelse(startups$top50pct > 0, 1, 0)

#Add a few additional predictors
startups$name_length = nchar(startups$normalized_name)

show_nas(startups)

With clean data, we can now perform the data split for training and data

#Generate the full data
write_csv(x = startups, path = "./startups.csv")

set.seed(420)
train_index = sample(1:nrow(startups), size = floor(0.8 * nrow(startups)))
startups_tr = startups[train_index,]
startups_te = startups[-train_index,]

#Save the train and test datasest
write_csv(x = startups_tr, path = "./startups_tr.csv")
write_csv(x = startups_te, path = "./startups_te.csv")

The dataset to be used, startups has 2472 rows of data.

A sample output of the data is below:

aquisition_price_amount city funding_total_usd category_code funding_rounds milestones relationships
20000000 other 0 games_video 0 0 6
47500000 mountain view 5000000 web 1 3 14
15000000 san francisco 3000000 games_video 1 2 6
262500000 other 274999 hardware 1 0 4
36500000 other 15286415 other 3 1 2
100000000 other 28000000 hardware 1 3 7

Moreover, a very simple model illustrates feasibility for linear regression analysis:

#Linear Model
company_model = lm(log(aquisition_price_amount) ~ log(funding_total_usd+1) + category_code + funding_rounds + milestones + relationships + istop25 + istop26_50, data=startups)
summary(company_model)
## 
## Call:
## lm(formula = log(aquisition_price_amount) ~ log(funding_total_usd + 
##     1) + category_code + funding_rounds + milestones + relationships + 
##     istop25 + istop26_50, data = startups)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.3812  -1.1325   0.2108   1.6253   8.0076 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   16.574302   0.300837  55.094  < 2e-16 ***
## log(funding_total_usd + 1)     0.002897   0.011861   0.244 0.807083    
## category_codebiotech           1.422294   0.336273   4.230 2.43e-05 ***
## category_codeenterprise        0.240485   0.378627   0.635 0.525389    
## category_codegames_video      -0.010138   0.403346  -0.025 0.979949    
## category_codehardware          0.628013   0.417439   1.504 0.132596    
## category_codemobile            0.174453   0.378273   0.461 0.644708    
## category_codenetwork_hosting   0.306121   0.438431   0.698 0.485107    
## category_codeother             0.781903   0.319944   2.444 0.014601 *  
## category_codepublic_relations  0.778358   0.422489   1.842 0.065549 .  
## category_codesemiconductor     0.873731   0.428455   2.039 0.041531 *  
## category_codesoftware          0.420082   0.318961   1.317 0.187951    
## category_codeunknown          -1.909558   0.320396  -5.960 2.89e-09 ***
## category_codeweb              -0.207972   0.333049  -0.624 0.532391    
## funding_rounds                -0.108816   0.071172  -1.529 0.126414    
## milestones                     0.171153   0.059412   2.881 0.004001 ** 
## relationships                  0.024988   0.004549   5.494 4.35e-08 ***
## istop25                        1.202162   0.137580   8.738  < 2e-16 ***
## istop26_50                     0.627238   0.168145   3.730 0.000196 ***
## ---
<<<<<<< HEAD
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
=======
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
>>>>>>> 9fb3f8a7077bbcc5f0845e9546f51b9c2f2851e1
## 
## Residual standard error: 2.637 on 2453 degrees of freedom
## Multiple R-squared:  0.2632, Adjusted R-squared:  0.2578 
## F-statistic: 48.69 on 18 and 2453 DF,  p-value: < 2.2e-16
  • \(R^2= 0.263225\)
  • \(\text{p-value} = 5.9345821\times 10^{-148}\)

In addition, we can see how the linear regression assumptions of normality and constant variance are with the simple model. In the study, will better adjust tthe model in question to more appropriately describe the data with the model including its variance.

#Plots of the model
par(mfrow = c(1,2))
plot(company_model, which = 1:2)
<<<<<<< HEAD

=======

>>>>>>> 9fb3f8a7077bbcc5f0845e9546f51b9c2f2851e1